Fix flaky unit tests #2904

cblecker · 2022-12-04T22:33:38Z

Description of the change:
This fixes two flaky unit tests TestResolver and TestUpdates.

First, TestResolver, specifically the FailForwardEnabled/3EntryReplacementChain/ReplacementChainBroken/NotSatisfiable test case.

This test was changed in #2788 and currently references one of the three failed CSVs in the test namespace.

However, when you run this test multiple times, the resolver cache is sometimes handing back a different CSV (the following was a purposefully broken match in order to always print the test result):

$ go test -count=10 -run TestResolver ./pkg/controller/registry/resolver | grep 'failed to populate resolver cache from source'
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v2 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v2 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v2 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"

Note how sometimes the resolver is returning catsrc-namespace/a.v1 and sometimes it's catsrc-namespace/a.v2.

As I understand this test, the specific CSV doesn't matter for the error as all three are in the CSVPhaseFailed state. Therefore, my proposed fix removes the CSV name from the error match, so that it will match on any of the three CSVs. I believe this was the original intent of the test, looking at what it was prior to #2788 (noting that the error matching was split in two, based on the same omission of the CSV name).

Second, TestUpdates. In this piece of code:

operator-lifecycle-manager/pkg/controller/operators/olm/operator_test.go

Lines 3925 to 3929 in 6ffec4d

    
           for current.Status.Phase != e.whenIn.phase { 
        
           	fmt.Printf("waiting for (when) %s to be %s\n", e.whenIn.name, e.whenIn.phase) 
        
           	csvsToSync = syncCSVs(csvsToSync, deletedCSVs(e.shouldBe)) 
        
           	current = csvsToSync[e.whenIn.name] 
        
           }

The for loop will keep hammering the fake operator with op.syncClusterServiceVersion(csv) and Get calls as fast as it possibly can. However, this is creating a race condition in the *RaceFreeFakeWatcher that is watching the fake OperatorGroup. Basically the watch channel is filling up (default watch channel length is 100 events, and if it fills up and closes, the go routine panics) more quickly than the operator can drain it.

The proposed fix here creates a sleep of 1ms between instances of this for loop. It only slows the test down negligibly, but it's enough to help the watch channel drain faster than it fills. In a real world Kubernetes API server, responses aren't going to be that fast anyways.

Motivation for the change:
Fix flaky unit tests, because they are the worst.

Architectural changes:

Testing remarks:

With fix, ran test 100 times to verify flake is gone:

$ go test -count=100 -run TestResolver ./pkg/controller/registry/resolver 
ok      github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver       391.615s
$ go test -count=100 -run TestUpdates ./pkg/controller/operators/olm
ok      github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/olm   13.768s

This compared to the current HEAD, which this test fails somewhere between 10-20% of the time.

Reviewer Checklist

openshift-ci · 2022-12-04T22:33:49Z

Hi @cblecker. Thanks for your PR.

I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cblecker · 2022-12-04T22:34:24Z

/cc @exdx @grokspawn
This is the flake that I ran into on #2903 :)

Signed-off-by: Christoph Blecker <cblecker@redhat.com>

grokspawn · 2022-12-05T16:35:17Z

/ok-to-test

ankitathomas

/lgtm

openshift-ci · 2022-12-06T13:36:39Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: ankitathomas, cblecker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from gallettilance and kevinrizza December 4, 2022 22:33

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 4, 2022

openshift-ci bot requested review from exdx and grokspawn December 4, 2022 22:34

cblecker force-pushed the fix-flake branch from 0465f9b to 7fa9ade Compare December 4, 2022 22:37

cblecker changed the title ~~Fix flaky TestResolver unit test~~ Fix flaky unit tests Dec 5, 2022

cblecker added 2 commits December 5, 2022 08:00

Fix flaky TestResolver unit test

3bcb36a

Signed-off-by: Christoph Blecker <cblecker@redhat.com>

Fix flaky TestUpdates unit test

0084d61

Signed-off-by: Christoph Blecker <cblecker@redhat.com>

cblecker force-pushed the fix-flake branch from cf628ab to 0084d61 Compare December 5, 2022 16:00

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 5, 2022

ankitathomas approved these changes Dec 5, 2022

View reviewed changes

openshift-ci bot assigned ankitathomas Dec 5, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2022

grokspawn added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2022

openshift-merge-robot merged commit 4da774f into operator-framework:master Dec 6, 2022

cblecker deleted the fix-flake branch December 6, 2022 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky unit tests #2904

Fix flaky unit tests #2904

cblecker commented Dec 4, 2022 •

edited

Loading

openshift-ci bot commented Dec 4, 2022

cblecker commented Dec 4, 2022

grokspawn commented Dec 5, 2022

ankitathomas left a comment

openshift-ci bot commented Dec 6, 2022

	for current.Status.Phase != e.whenIn.phase {
	fmt.Printf("waiting for (when) %s to be %s\n", e.whenIn.name, e.whenIn.phase)
	csvsToSync = syncCSVs(csvsToSync, deletedCSVs(e.shouldBe))
	current = csvsToSync[e.whenIn.name]
	}

Fix flaky unit tests #2904

Fix flaky unit tests #2904

Conversation

cblecker commented Dec 4, 2022 • edited Loading

openshift-ci bot commented Dec 4, 2022

cblecker commented Dec 4, 2022

grokspawn commented Dec 5, 2022

ankitathomas left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 6, 2022

cblecker commented Dec 4, 2022 •

edited

Loading